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MOTChallenge 2015: 

Towards a Benchmark for Multi-Target Tracking 

Laura Leal-Taixe*, Anton Milan*, Ian Reid, Stefan Roth, and Konrad Schindler 


Abstract —In the recent past, the computer vision community has developed centralized benchmarks for the performance evaluation of 
a variety of tasks, including generic object and pedestrian detection, 3D reconstruction, optical flow, single-object short-term tracking, 
and stereo estimation. Despite potential pitfalls of such benchmarks, they have proved to be extremely helpful to advance the state of 
the art in the respective area. Interestingly, there has been rather limited work on the standardization of quantitative benchmarks for 
multiple target tracking. One of the few exceptions is the well-known PETS dataset [20], targeted primarily at surveillance applications. 
Despite being widely used, it is often applied inconsistently, for example involving using different subsets of the available data, different 
ways of training the models, or differing evaluation scripts. This paper describes our work toward a novel multiple object tracking 
benchmark aimed to address such issues. We discuss the challenges of creating such a framework, collecting existing and new data, 
gathering state-of-the-art methods to be tested on the datasets, and finally creating a unified evaluation system. With MOTChallenge 
we aim to pave the way toward a unified evaluation framework for a more meaningful quantification of multi-target tracking. 

Index Terms —multiple people tracking, benchmark, evaluation metrics, dataset 
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1 Introduction 

Evaluating and comparing multi-target tracking meth¬ 
ods is not trivial for numerous reasons (cf. e.g. (37) ). 
First, unlike for other tasks, such as image denoising, 
the ground truth, i.e. the perfect solution one aims to 
achieve, is difficult to define clearly. Partially visible, 
occluded, or cropped targets, reflections in mirrors or 
windows, and objects that very closely resemble targets 
all impose intrinsic ambiguities, such that even humans 
may not agree on one particular ideal solution. Second, a 
number of different evaluation metrics with free parame¬ 
ters and ambiguous definitions often lead to inconsistent 
quantitative results across the literature. Finally, the lack 
of pre-defined test and training data makes it difficult to 
compare different methods fairly. 

Multi-target tracking is a crucial problem in scene 
understanding, which, in contrast to other research areas, 
still lacks large-scale benchmarks. We believe that a 
unified evaluation platform that allows participants to 
submit not only their own tracking methods, but also 
their own data, including video and annotations, as well 
as propose new evaluation methodologies that can be 
applied instantaneously to all previous approaches, can 
bring a vast benefit to the computer vision community. 

To that end we develop the MOTChallenge benchmark, 
consisting of three main components: (1) a publicly avail- 
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able dataset, (2) a centralized evaluation method, and 
(3) an infrastructure that allows for crowdsourcing of 
new data, new evaluation methods and even new anno¬ 
tations. The first release of the dataset contains a total of 
22 sequences, half for training and half for testing, with 
a total of 11286 frames or 996 seconds of video. Camera 
calibration is provided for 4 of those sequences to enable 
3D real-world coordinate tracking. We also provide pre¬ 
computed object detections, annotations, and a common 
evaluation method for all datasets, so that all results 
can be compared in a fair way. The final goal is to 
collect sequences from all researchers who are willing 
to contribute to the benchmark, enabling an update of 
the data, new evaluation tools, new annotations, etc. to 
be made available yearly. 

We anticipate two ways of submitting to the MOTChal¬ 
lenge benchmark: (1) year-round, or (2) submissions to a 
specific workshop or challenge, which is to be held once 
a year. The purpose of the former is to keep track of 
state-of-the-art methods submitted at major conferences 
and journals, to allow for a fair comparison between 
methods by ensuring that all are using the same datasets 
and the same evaluation methods. The latter follows 
the well-known format of yearly challenges that have 
been very successful in other areas, e.g., in the PASCAL 
VOC series 1191 or the ImageNet competitions (42) . These 
challenges and workshops provide a way to track and 
discuss the progress and innovations of state-of-the-art 
methods presented over the year. The first workshop [2] 
organized on the MOTChallenge benchmark took place 
in early 2015 in conjunction with the Winter Conference 
on Applications of Computer Vision (WACV). 
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Goals. This paper has three main goals: 

1) To discuss the challenges of creating a multi-object 
tracking benchmark; 

2) to analyze current datasets and evaluation meth¬ 
ods; 

3) to bring forward the strengths and weaknesses of 
state-of-the-art multi-target tracking methods. 

The benchmark with all datasets, current ranking and 
submission guidelines can be found at: 

http://www.motchallenge.net/ 

1.1 Related work 

Benchmarks and challenges. In the recent past, the 
computer vision community has developed centralized 
benchmarks for numerous tasks including object detec¬ 
tion 1191, pedestrian detection (17]|, 3D reconstruction 
(45) , optical flow IT & visual odometry ||22j, single¬ 
object short-term tracking [ |28| , and stereo estimation 
(22) , |43) . Despite potential pitfalls of such benchmarks 
( e.g . (481), they have proved to be extremely helpful to 
advance the state of the art in the respective area. For 
multiple target tracking, in contrast, there has been very 
limited work on standardizing quantitative evaluation. 

One of the few exceptions is the well known PETS 
dataset [20|, targeted primarily at surveillance applica¬ 
tions. The 2009 version consisted of 3 subsets, SI targeted 
at person count and density estimation, S2 targeted 
at people tracking, and S3 targeted at flow analysis 
and event recognition. The easiest sequence for tracking 
(S2L1) consisted of a scene with few pedestrians, and 
for that sequence state-of-the-art methods perform ex¬ 
tremely well with accuracies of over 90% given a good 
set of initial detections |24j, |36]|, [55J. Methods then 
moved to tracking on the hardest of the sequences (with 
most crowd density) , but hardly ever on the complete 
dataset. Even for this widely used benchmark, we ob¬ 
serve that tracking results are commonly obtained in an 
inconsistent fashion: involving using different subsets of 
the available data, different detection inputs, inconsistent 
model training that is often prone to overfitting, and 
varying evaluation scripts. Results are thus not easily 
comparable. So the question that arises here is: Are 
these sequences already too easy for current tracking 
methods, are methods simply overfitted, or are they 
poorly evaluated? 

The PETS team organized a workshop approximately 
once a year to which researchers could submit their 
results, and methods were evaluated under the same 
conditions. Although this was indeed a fair comparison, 
the fact that submission was only once a year meant that 
the use of this benchmark for high impact conferences 
like ICCV or CVPR still remained an issue. 

A well-established and useful way of organizing 
datasets is through standardized challenges. These are 
usually in the form of web servers that host the data and 
through which results are uploaded by the users. Results 


are then computed in a centralized way by the server 
and afterwards presented online to the public, making 
comparison with any other method immediately possi¬ 
ble. There are several datasets organized in this fashion: 
the Labeled Faces in the Wild f25| for unconstrained face 
recognition, the PASCAL VOC |10 for object detection, 
the ImageNet large scale visual recognition challenge 
(42) , or the Reconstruction Meets Recognition Challenge 
(RMRC) Q. 

Recently, the KITTI benchmark ||22j was introduced 
for challenges in autonomous driving, which included 
stereo/flow, odometry, road and lane estimation, object 
detection and orientation estimation, as well as track¬ 
ing. Some of the sequences include crowded pedestrian 
crossings, making the dataset quite challenging, but the 
camera position is always the same for all sequences (at 
a car's height). 

With the MOTChallenge benchmark, we aim to increase 
the difficulty by including a variety of sequences filmed 
from different viewpoints, with different lighting condi¬ 
tions, and different levels of crowd density. In addition 
to other existing and new data, we include sequences 
from both PETS and KITTI datasets. The real challenge 
we see is not to perform well on an individual sequence, 
but rather to perform well on a diverse set of sequences. 

Another work that is worth mentioning is |4 |, in which 
the authors collect a very large amount of data with 42 
million pedestrian trajectories. Since annotation of such 
a large collection of data is infeasible, they use a denser 
set of cameras to create the "ground truth" trajectories. 
Though we do not aim at collecting such a large amount 
of data, the goal of our benchmark is somewhat similar: 
to push research in tracking forward by generalizing the 
test data to a larger set that is highly variable and hard 
to overfit. 

Evaluation. A critical point with any dataset is how 
to measure the performance of the algorithms. In the 
case of multiple object tracking, the CLEAR metrics 
(27) have emerged as one of the standard measures. 
By measuring the intersection over union of bounding 
boxes and matching those from annotations and results, 
measures of accuracy and precision can be computed. 
Precision measures how well the persons are localized, 
while accuracy evaluates how many distinct errors such 
as missed targets, ghost trajectories, or identity switches 
are made. 

Another set of measures that is widely used in the 
tracking community is that of (33) . There are three 
widely used metrics introduced in that work: mostly 
tracked, mostly lost, and partially tracked pedestrians. 
These numbers give a very good intuition on the perfor¬ 
mance of the method. We refer the reader to Section [5] 
for more formal definitions. 

A key parameter in the both families of metrics is the 
intersection-over-union threshold, which determines if 
a bounding box is matched to an annotation or not. It 
is fairly common to observe methods compared under 
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different thresholds, varying from 25% to 50%. There are 
often many other variables and implementation details 
that differ between evaluation scripts, but which may 
affect results significantly 

It is therefore clear that standardized benchmarks 
are the only way to compare methods in a fair and 
principled way Using the same ground truth data and 
evaluation methodology is the only way to guarantee 
that the only part being evaluated is the tracking method 
that delivers the results. This is the main goal behind this 
paper and behind the MOTChallenge benchmark. 

2 Benchmark Submission 

Our benchmark consists of the database and evaluation 
server on one hand, and the website as the user interface 
on the other. It is open to everyone who respects the 
submission policies (see next section). Before partici¬ 
pating, every user is required to create an account, if 
possible providing an institutional and not a generic e- 
mail addres^] After registering, the user can create a 
new tracker with a unique name and enter all additional 
details. It is mandatory to indicate 

• the challenge or benchmark category in which the 
tracker will be participating, 

• the full name and a brief description of the method 
including the parameters used, 

• whether the method operates online or on a batch 
of frames and whether the source code is publicly 
available, 

• the total runtime in seconds for computing the re¬ 
sults on all sequences and the hardware used, and 

• whether only the provided or also external training 
and detection data were used. 

After entering all details, it is possible to submit the 
results in the format described in Sec. |3.4| The tracking 
results will be automatically evaluated and appear on the 
user's profile. They will not be displayed in the public 
ranking table. The user can then decide at any point in 
time to make the results public. Note that the results can 
be published anonymously, e.g. to enable a blind review 
process for a corresponding paper. In this case, we ask 
to provide the venue and the paper ID or a similar 
unique reference. We request that a proper reference to 
the method's description is added upon acceptance of 
the paper. In case of rejection, an anonymous entry may 
also be removed from the database. Anonymous entries 
will also be removed after six months of inactivity. 

The tracker's meta information such as description, or 
project page can be edited at any time. Visual results 
of all public submissions, as well as annotations and 
detections can be viewed on a dedicated visualization 
P a g€0 

1. For accountability and to prevent abuse by using several email 
accounts. 

2. http://motchallenge.net/vis/ 


2.1 Submission policy 

The main goal of this benchmark is to provide a platform 
that allows for an objective performance comparison of 
multiple target tracking approaches on real-world data. 
Therefore, we introduce a few simple guidelines that 
must be followed by all participants. 

Training. Ground truth is only provided for the train¬ 
ing sequences. It is the participant's own responsibility 
to find the best setting using only the training data. The 
use of additional training data must be indicated during 
submission and will be visible in the public ranking 
table. The use of ground truth labels on the test data 
is strictly forbidden. This or any other misuse of the 
benchmark will lead to the deletion of the participant's 
account and their results. 

Detections. We also provide a unique set of detections 
(see Sec. \3.3\ for each sequence. We expect all tracking- 
by-detection algorithms to use the given detections. In 
case the user wants to present results with another set 
of detections or is not using detections at all, this should 
be clearly stated during submission and will also be 
displayed in the results table. 

Submission frequency. Generally, we expect one sin¬ 
gle submission for every tracking approach. If for any 
reason, the user needs to re-compute and re-submit the 
results (e.g. due to a bug discovered in the implemen¬ 
tation), he/she may do so after a waiting period of 
72 hours after the last submission. This policy should 
discourage the use of the benchmark server for training 
and parameter tuning on the test data. The number of 
submissions is counted and displayed for each method. 
Under no circumstances must anyone create a second 
account and attempt to re-submit in order to bypass 
the waiting period. This may lead to a deletion of the 
account and excluding the user from participating in the 
benchmark. 


2.2 Challenges 

Besides the main benchmarks (2D MOT 2015, 3D MOT 
2015), we anticipate to organize multi-target tracking 
challenges on a regular basis, similar in spirit to the 
widely known PASCAL VOC series 1191, or the Ima- 
geNet competitions (42) . The main differences to the 
main benchmark are: 

• The dataset is typically smaller, but potentially more 
challenging. 

• There is a fixed submission deadline for all partici¬ 
pants. 

• The results are revealed and the winners awarded 
at a corresponding workshop. 

The first edition of our series was the WACV 2015 
Challenge that consisted of six new outdoor sequences 
with both moving and static cameras. The results were 
presented at the BMTT Workshop (2| held in conjunction 
with WACV 2015. 
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Training sequences 

Name 

FPS 

Resolution 

Length 

Tracks 

Boxes 

Density 

3D 

Camera 

Viewpoint 

Shadows 

Source 

TUD-Stadtmitte 

25 

640x480 

179 (00:07) 

10 

1156 

6.5 

yes 

static 

medium 

cloudy 




TUD-Campus 

25 

640x480 

71 (00:03) 

8 

359 

5.1 

no 

static 

medium 

cloudy 


\m 


PETS09-S2L1 

7 

768x576 

795 (01:54) 

19 

4476 

5.6 

yes 

static 

high 

cloudy 


m 


ETH-Bahnhof 

14 

640x480 

1000 (01:11) 

171 

5415 

5.4 

yes 

moving 

low 

cloudy 


IB 


ETH-Sunnyday 

14 

640x480 

354 (00:25) 

30 

1858 

5.2 

yes 

moving 

low 

sunny 


EB 


ETH-Pedcross2 

14 

640x480 

840 (01:00) 

133 

6263 

7.5 

no 

moving 

low 

sunny 


i« 


ADL-Rundle-6 

30 

1920x1080 

525 (00:18) 

24 

5009 

9.5 

no 

static 

low 

cloudy 

nev 

7 

ADL-Rundle-8 

30 

1920x1080 

654 (00:22) 

28 

6783 

10.4 

no 

moving 

medium 

night 

nev 

7 

KITTI-13 

10 

1242x375 

340 (00:34) 

42 

762 

2.2 

no 

moving 

medium 

sunny 


22 


KITTI-17 

10 

1242x370 

145 (00:15) 

9 

683 

4.7 

no 

static 

medium 

sunny 


22 


Venice-2 

30 

1920x1080 

600 (00:20) 

26 

7141 

11.9 

no 

static 

medium 

sunny 

r 

iew 

Total training 

5503 (06:29) 

500 

39905 

7.3 
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Testing sequences 

Name 

FPS 

Resolution 

Length 

Tracks 

Boxes 

Density 

3D 

Camera 

Viewpoint 

Weather 

Source 

TUD-Crossing 

25 

640x480 

201 (00:08) 

13 

1102 

5.5 

no 

static 

medium 

cloudy 


161 


PETS09-S2L2 

7 

768x576 

436 (01:02) 

42 

9641 

22.1 

yes 

static 

high 

cloudy 


20 


ETH-Jelmoli 

14 

640x480 

440 (00:31) 

45 

2537 

5.8 

yes 

moving 

low 

sunny 


18 


ETH-Linthescher 

14 

640x480 

1194 (01:25) 

197 

8930 

7.5 

yes 

moving 

low 

sunny 


18 


ETH-Crossing 

14 

640x480 

219 (00:16) 

26 

1003 

4.6 

no 

moving 

low 

cloudy 


18 


AV G-TownCentre 

2.5 

1920x1080 

450 (03:45) 

226 

7148 

15.9 

yes 

static 

high 

cloudy 


10 


ADL-Rundle-1 

30 

1920x1080 

500 (00:17) 

32 

9306 

18.6 

no 

moving 

medium 

sunny 

i 

lev 

7 

ADL-Rundle-3 

30 

1920x1080 

625 (00:21) 

44 

10166 

16.3 

no 

static 

medium 

sunny 

nev 

7 

KITTI-16 

10 

1242x370 

209 (00:21) 

17 

1701 

8.1 

no 

static 

medium 

sunny 


Eg 


KITTI-l 9 

10 

1242x374 

1059 (01:46) 

62 

5343 

5.0 

no 

moving 

medium 

sunny 




Venice-1 

30 

1920x1080 

450 (00:15) 

17 

4563 

10.1 

no 

static 

medium 

sunny 

i 

iev 

7 

Total testing 

5783 (10:07) 

721 

61440 

10.6 







TABLE 1: Overview of the sequences currently included in the benchmark. 


3 Datasets 

One of the key aspects of any benchmark is data collec¬ 
tion. The goal of MOTChallenge is not only to compile yet 
another dataset with completely new data, but rather to: 
(1) create a common framework to test tracking methods 
on; (2) gather existing and new challenging sequences 
with very different characteristics (frame rate, pedes¬ 
trian density, illumination or point of view) in order to 
challenge researchers to develop more general tracking 
methods that can deal with all types of sequences. In 
Table [l] we show an overview of the sequences included 
in the benchmark. 


• Viewpoint: the camera can overlook the scene from 
a high position, a medium position (at pedestrian's 
height), or at a low position. 

• Weather: the weather conditions in which the 
sequence was taken are reported in order to get 
an estimate of the illumination conditions of the 
scene. Sunny sequences may contain shadows 
and saturated parts of the image, while the 
night sequence contains a lot of motion blur, 
making pedestrian detection and tracking rather 
challenging. Cloudy sequences on the other hand 
contain fewer of those artifacts. 


3.1 2D MOT 2015 sequences 

We have compiled a total of 22 sequences, of which we 
use half for training and half for testing. The annotations 
of the testing sequences will not be released in order 
to avoid (over)fitting of the methods to the specific 
sequences. Nonetheless, the test data contains over 10 
minutes of footage and 61440 annotated bounding boxes, 
therefore, it is hard for algorithms to overtune on such a 
large amount of data. This is one of the major strengths 
of the benchmark. 

Sequences are very different from each other, we can 
classify them according to: 


Moving or static camera: the camera can be held by 
a person |2j, placed on a stroller [18] or on a car 
""1, or can be positioned fixed in the scene. 


We divided the sequences into training and testing in 
order to have a balanced distribution, as we can see in 


3.1.1 New sequences 

We introduce 6 new challenging sequences, 4 filmed 
from a static camera and 2 from a moving camera held 
at pedestrian's height. Three of the sequences are partic¬ 
ularly difficult: a night sequence filmed from a moving 
camera and two outdoor sequences with a high density 
of pedestrians. The moving camera together with the low 
illumination creates a lot of motion blur, making this 
sequence extremely challenging. In the future, we will 
include further sequences captured on rainy or foggy 
days and evaluate how methods perform under those 
special conditions. A special challenge including only 
these 6 new sequences was held at the 1st Workshop on 
Benchmarking Multi-Target Tracking (2). The best per- 
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(a) Camera motion (b) Viewpoint (c) Weather conditions 

Fig. 1: Comparison histogram between training and testing sequences of (a) static vs. moving camera, (b) camera 
viewpoint: low, medium or high, (c) weather conditions: cloudy, sunny, or night. 


forming algorithm reached a MOTA (tracking accuracy) 
of 12.7%, showing how challenging these new sequences 
arc0 

3.2 3D MOT 2015 sequences 

A pedestrian's 3D position is typically obtained by pro¬ 
jecting the 2D position of the feet of the person into 
the 3D world, e.g., by using a homography between 
the image plane and the ground plane. The bottom- 
center point of the bounding box is commonly chosen to 
represent the position of the feet of the pedestrian, but 
this may not be particularly accurate, if the bounding 
box is not placed very tightly around the pedestrian's 
silhouette, or if the limbs are extended asymmetrically. 
By the nature of projective geometry, even slight 2D 
misplacements can cause large 3D errors. It is therefore 
clear, that obtaining accurate 3D information only from 
bounding boxes is a challenging task. 

In this section, we detail how the 3D information 
is obtained for both static and moving cameras, and 
discuss whether current calibration and annotation pairs 
are accurate enough for reliable 3D tracking. For the 
sequences with a moving camera we show that these 
errors are too large for tracking purposes, and therefore 
we argue not to include those sequences in the 3D 
benchmark. We thus limit the 3D category to a few 
available 3D sequences with a static camera, but plan to 
extend the number of 3D sequences in the near future. 

3.2.1 Static camera sequences 

For the 4 sequences filmed using a static camera, 
AVG-TownCentre, PETS09-S2L1, PETS09-S2L2 and TUD- 
Stadtmitte, the calibration files from the sources 03 
[201 are used to compute a 2D homography between the 
image plane and the ground plane. All z coordinates are 
set to 0, indicating the position of the feet of the pedes¬ 
trian. In order to measure the accuracy of the calibration 

3. The challenge results are available at http://motchallenge.net/ 
results/WACV_2015_Challenge/ 


for each sequence, we use the manually annotated tra¬ 
jectories and plot the velocities of the pedestrians at each 
frame. Realistic walking speeds range from 0 - 3 m/s, 
with a mean comfortable walking speed of 1.4 m/s. This 
is confirmed by the distribution that we see in Fig. |2(a)| 
The other sequences (Figs. 2(b)|2(d)| ) have a few speeds 
in the range of 3 - 10 m/s. These are not real speeds, 
since there are no running pedestrians in the sequences. 
These outliers are largely due to projective geometry. For 
example, variations in the size of the bounding box can 
introduce artificial shifts in 2D that greatly affect the 3D 
position in the scene. 

The bottom row of Fig. [2] shows the mean speeds per 
pedestrian and sequence. As we can see, most pedes¬ 
trians walk at a speed between 1 - 1.5 m/s, hence we 
can conclude that the calibration is accurate enough for 
tracking. 

We can further analyze the spurious high speeds that 
we observe in some sequences by plotting a speed 
distribution in image space as shown in the top row 
of Fig. [3j For PETS09-S2L2, we can see that most arti¬ 
facts are concentrated on the bottom-right part of the 
image, where pedestrians leave the scene. The fact that 
a leaving pedestrian is cropped by the image border 
makes the bounding box around it thinner (following 
the annotation policy of PETS). Since the bottom-center 
of the bounding box is the 2D position used to obtain the 
3D information, the 2D position is likely shifted away 
from the real position of the pedestrian's feet. In the 
case of AVG-TownCentre, we can see some points on 
the image where unusually high speeds are observed. 
These are typically far away from the camera and present 
at the beginning or the end of the sequence, where 
correct annotation is difficult. This also accounts for the 
peaks of mean speed in Fig. |4(g) which belong to two 
pedestrians observed for 2 frames and are far away from 
the camera. Their position fluctuation is a simple artifact 
of variability in the bounding box placement. Finally, 
for the TUD-Stadtmitte sequence, we clearly see that 
high velocities are concentrated in the part of the image 
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(e) PETS09-S2L1 



(f) PETS09-S2L2 


(g) AVG-TownCentre 




(h) TUD-Stadtmitte 


Fig. 2: Top row: Pedestrian speed histograms per sequence. Bottom row: Mean speed per pedestrian per sequence. 


that is far away from the camera, and therefore also far 
away from the points used for calibration. The bounding 
box shifts in that area have a bigger impact on the 3D 
position. Avoiding the effects discussed would require 
new "3D aware" annotations for each sequence, which 
we leave for future work. 

3.2.2 Moving camera sequences 

For the sequences with moving cameras, the authors 1181 
provide one file for each image, containing the calibra¬ 
tion of the left camera, which allows us to backproject 
the feet of the person to the world coordinate system 
and to find the 3D position by intersecting the ray with 
the estimated ground plane. The error of this calibration 
increases significantly as pedestrians move away from 
the camera, which makes tracking of far away objects 
very imprecise. 

We do the same velocity analysis for these sequences, 
shown in Fig. [4j and observe that some of the velocities 
reach 200 - 700 m/s, indicating a clear problem in the 
3D position estimation. Looking at the mean velocity per 
pedestrian, we can observe very high velocities, espe¬ 
cially for ETH-Bahnhof and ETH-Linthescher sequences. 
These are mostly pedestrians walking far away from 
the camera and usually visible for a short period of 
time. This is further observable in the maps of Fig. |5j 
where we can see that these incorrect peak velocities 
are found mostly in the region far away (~ 10 m) 
from the camera. This again illustrates the challenge of 
obtaining accurate 3D information from 2D bounding 
boxes, simply due to the nature of projective geometry. 
For these sequences, the added inaccuracy introduced by 
the automatic ground plane angle estimation makes the 
3D information unreliable, which is why we decided not 
to include these sequences in the 3D benchmark. 


As future work, we plan on using the additional 
view provided for these sequences to strengthen the 3D 
estimation. Ideally and for all sequences, the pedestrian's 
feet should be annotated directly, since in general an¬ 
notations for 2D and 3D tracking purposes may differ. 
Further annotation issues are discussed in Sec. I5H 

3.3 Detections 

To detect pedestrians in all images, we use the recent ob¬ 
ject detector implementation of Dollar et al. 1161, which is 
based on aggregated channel features (ACF). We rely on 
the default parameters and the pedestrian model trained 
on the INRIA dataset |14|, rescaled with a factor of 0.6 to 
enable the detection of smaller pedestrians. The minimal 
bounding box height in our benchmark is 59 pixels. The 
detector performance along with three sample frames is 
depicted in Fig. [6] for both the training and the test set of 
the benchmark. Note that the recall does not reach 100% 
because of the non-maximum suppression applied. 

Obviously, we cannot (nor necessarily want to) pre¬ 
vent anyone from using a different set of detections, 
or even rely on a different set of features altogether. 
However, we require that this is noted as part of the 
tracker's description and is also displayed in the ratings 
table. 

3.4 Data format 

All images were converted to JPEG and named sequen¬ 
tially to a 6-digit file name ( e.g . 000001.jpg). Detection 
and annotation files are simple comma-separated value 
(CSV) files. Each line represents one object instance, and 
it contains 10 values as shown in Tab. |2] 

The first number indicates in which frame the object 
appears, while the second number identifies that object 
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Fig. 3: Top row: Speed distributions in image space for the PETS09-S2L1, PETS09-S2L2, AVG-TownCentre and TUD- 
Stadtmitte sequences, respectively. Note that the scale is different for each image. Bottom row: Sample frame for 
each sequence. 
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Fig. 4: Top row: Pedestrian speed histograms per sequence. Bottom row: Mean speed per pedestrian per sequence. 


as belonging to a trajectory by assigning a unique ID (set 
to —1 in a detection file, as no ID is assigned yet). Each 
object can be assigned to only one trajectory. The next 
four numbers indicate the position of the bounding box 
of the pedestrian in 2D image coordinates. The position 
is indicated by the top-left corner as well as width 
and height of the bounding box. This is followed by a 
single number, which in case of detections denotes their 
confidence score. The last three numbers indicate the 3D 
position in real-world coordinates of the pedestrian. This 
position represents the feet of the person. In the case of 
2D tracking, these values will be ignored and can be left 


at -1. 

An example of such a detection 2D file is: 


1 , 

-If 

794 . 2 , 

47 . 5 , 

71 . 2 , 

174 . 8 , 

67 . 5 , 

- 1 , 

-If 

-1 

1 , 

-If 

164 . 1 , 

19 . 6 , 

66 . 5 , 

163 . 2 , 

29 . 4 , 

- 1 , 

-If 

-1 

1 , 

- 1 , 

875 . 4 , 

39 . 9 , 

25 . 3 , 

145 . 0 , 

19 . 6 , 

- 1 , 

-1, 

-1 

2 , 

-If 

781 . 7 , 

25 . 1 , 

69 . 2 , 

170 . 2 , 

58 . 1 , 

- 1 , 

-If 

-1 


For the ground truth and results files, the 7 th value 
(confidence score) acts as a flag whether the entry is to 
be considered. A value of 0 means that this particular 
instance is ignored in the evaluation, while a value of 1 
is used to mark it as active. 

























































































Fig. 5: Top row: Speed distributions in image space for the ETH-Bahnhof, ETH-Sunnyday, ETH-Linthescher and 
ETH-Jelmoli sequences, respectively Note that the scale is different for each image. Bottom row: Sample frame for 
each sequence. 



(a) Detection performance of [16] 


(b) ADL-Rundle-8 


(c) Venice-1 


(d) KITTI-16 


Fig. 6: (a) The performance of the provided detection bounding boxes evaluated on the training (blue) and the 
test (red) set. The circle indicates the operating point ( i.e . the input detection set) for the trackers, (b-d) Exemplar 
detection results. 


An example of such an annotation 2D file is: 


1 , 

1 , 

794 . 2 , 

47 . 5 , 

71 . 2 , 

174 . 8 , 

1 , 

- 1 , 

- 1 , 

-1 

1 , 

2 , 

164 . 1 , 

19 . 6 , 

66 . 5 , 

163 . 2 , 

1 , 

- 1 , 

- 1 , 

-1 

If 

3 , 

875 . 4 , 

39 . 9 , 

25 . 3 , 

35 . 0 , 

0 , 

-If 

- 1 , 

-1 

2 , 

1 , 

781 . 7 , 

25 . 1 , 

69 . 2 , 

170 . 2 , 

1 , 

- 1 , 

- 1 , 

-1 


In this case, there are 2 pedestrians in the first frame of 
the sequence, with identity tags 1,2. The third pedestrian 
is too small and therefore not considered, which is 
indicated with a flag value (7 th value) of 0. In the second 
frame, we can see that pedestrian 1 remains in the scene. 
Note, that since this is a 2D annotation file, the 3D 
positions of the pedestrians are ignored and therefore are 
set to -1. Note that all values including the bounding box 
are 1-based, i.e. the top left corner corresponds to (1,1). 

To obtain a valid result for the entire benchmark, 
a separate CSV file following the format described 
above must be created for each sequence and called 
' 'Sequence-Name.txt". All files must be com¬ 
pressed into a single zip file that can then be uploaded 
to be evaluated. 


3.5 Expansion through crowdsourcing 

We foresee a yearly update of the benchmark datasets in 
order to include new, more challenging sequences and 
eventually remove outdated or repetitive sequences. 
The goal is to push forward research in multi-target 
tracking by increasing the difficulty of the data as 
new, more accurate methods are proposed by the 
community. We want to make a call to the community 
to share their sequences, detections or annotations to 
the benchmark, so as to include a large variety of data. 
More importantly, the goal is to increase the type of 
data to the following categories: 

• Tracking of cars, bicycles, etc. in outdoor scenarios; 

• Biological data such as cell, bird or fish tracking; 

• Sports data: basketball games, hockey or soccer; 

• Large-scale multi-view sequences. 

Sequences of such scenarios do exist in the literature, 
e.g. thousands of bats filmed using thermal cameras 
[53|, cell tracking data [32|, basketball game 1111, hockey 
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Position 

Name 

Description 

1 

Frame number 

Indicate at which frame the object is present 

2 

Identity number 

Each pedestrian trajectory is identified by a unique ID (—1 for detections) 

3 

Bounding box left 

Coordinate of the top-left corner of the pedestrian bounding box 

4 

Bounding box top 

Coordinate of the top-left corner of the pedestrian bounding box 

5 

Bounding box width 

Width in pixels of the pedestrian bounding box 

6 

Bounding box height 

Height in pixels of the pedestrian bounding box 

7 

Confidence score 

Indicates how confident the detector is that this instance is a pedestrian. For the ground truth and 
results, it acts as a flag whether the entry is to be considered. 

8 

X 

3D x position of the pedestrian in real-world coordinates (—1 if not available) 

9 

y 

3D y position of the pedestrian in real-world coordinates (—1 if not available) 

10 

z 

3D z position of the pedestrian in real-world coordinates (—1 if not available) 


TABLE 2: Data format for the input and output files, both for detection and annotation files. 


game [38], several indoor multi-view sequences (n), 
or recent large-scale multi-view sequences containing 
millions of pedestrian trajectories [4|. We encourage the 
community to contact us with any interesting data that 
they would like included in the benchmark structure, 
and we commit to extending the benchmark to other 
interesting and relevant categories. Each category will 
have its own separate submissions. 


4 Baseline Methods 


As a starting point for the benchmark, we have included 
a number of recent multi-target tracking approaches as 
baselines, which we will briefly outline for complete¬ 
ness but refer the reader to the respective publication 
for more details. Note that we have used the publicly 
available cod e an d trained all of them in the same 
wa}Q(c/. Sec. 4.1). However, we explicitly state that the 
provided numbers may not represent the best possible 
performance for each method, as could be achieved by 
the authors themselves. Table [3] lists current benchmark 
results for all baselines as well as for all anonymous 
entries at the time of writing of this manuscript. 


4.1 Training and testing 

Most of the available tracking approaches do not include 
a learning (or training) algorithm to determine the set of 
model parameters for a particular dataset. Therefore, we 
follow a simplistic search scheme for all baseline meth¬ 
ods to find a good setting for our benchmark. To that 
end, we take the default parameter set © := { 6 * 1 ,..., Op} 
as suggested by the authors, where P is the number of 
free parameters for each method. We then perform 100 
independent runs on the training set with varying pa¬ 
rameters. In each run, a parameter value 0i is uniformly 
sampled around its default value in the range [^0 i: 20i\. 
Finally, the parameter set 0* that achieved the highest 
MOTA score across all 100 runs ( cf . Sec. |5.2.3| ) is taken 
as the optimal setting and run once on the test set. The 
optimal parameter set is stated in the description entry 
for each baseline method on the benchmark website. 

4. Except for TBD, which does not disclose any obvious free param¬ 
eters. 


4.2 dp nms: Network flow tracking 

Since its original publication (57) , a large number of 
methods that are based on the network flow formulation 
have appeared in the literature (13), (31), (34), (4l), (50) . 
The basic idea is to model the tracking as a graph, 
where each node represents a detection and each edge 
represents a transition between two detections. Special 
source and sink nodes allow spawning and absorbing 
trajectories. A solution is obtained by finding the mini¬ 
mum cost flow in the graph. Multiple assignments and 
track splitting is prevented by introducing binary and 
linear constraints. 

Here we use two solvers: (i) the successive shortest 
paths approach (41) that employs dynamic programming 
with non-maxima suppression, termed DP_NMS; (ii) a 
linear programming solver that we use for both 2D and 
3D data (lp2d and lp3d, respectively), and that appears 
as a baseline in [29|. This solver uses the Gurobi Library 

0 

4.3 cem: Continuous energy minimization 

CEM [361 formulates the problem in terms of a high¬ 
dimensional continuous energy. Here, we use the basic 
approach [7] without explicit occlusion reasoning or 
appearance model. The target state X is represented by 
continuous x,y coordinates in all frames. The energy 
E(X) is made up of several components, including a 
data term to keep the solution close to the observed data 
(detections), a dynamic model to smooth the trajectories, 
an exclusion term to avoid collisions, a persistence term 
to reduce track fragmentations, and a regularizes The re¬ 
sulting energy is highly non-convex and is minimized in 
an alternating fashion using conjugate gradient descent 
and deterministic jump moves. 

4.4 smot: Similar moving objects 

The Similar Multi-Object Tracking (SMOT) approach [ 15 [ 
specifically targets situations where target appearance is 
ambiguous and rather concentrates on using the motion 
as a primary cue for data association. Tracklets with 
similar motion are linked to longer trajectories using 
the generalized linear assignment (GLA) formulation. 
The motion similarity and the underlying dynamics of 
a tracklet are modeled as the order of a linear regressor 
approximating that tracklet. 
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4.5 tbd: Tracking-by-detection 

This two-stage tracking-by-detection (tbd) approach 
(21) , (56) is part of a larger traffic scene understanding 
framework and employs a rather simple data association 
technique. The first stage links overlapping detections 
with similar appearance in successive frames into track- 
lets. The second stage aims to bridge occlusions of up to 
20 frames. Both stages employ the Hungarian algorithm 
to optimally solve the matching problem. Note that we 
did not re-train this baseline but rather used the original 
implementation and parameters provided. 


5 Evaluation 

Evaluating multiple object tracking to this day remains a 
surprisingly difficult task. Even though many measures 
have been proposed in the past 1121, (33), (44), 1461 , 
(4 7) , [52|, comparing a new method against prior art is 
typically not straightforward. As discussed in some of 
our previous work [37|, the reasons for that are diverse 
ranging from ambiguous ground truth, imprecise metric 
definitions or implementation variations. In this section 
we will describe the entire evaluation procedure of our 
benchmark in detail. 


4.6 sfm: Social forces for tracking 

Most tracking systems work with the assumption that 
the motion model for each target is independent, but in 
reality, a pedestrian follows a series of social rules, i.e. is 
subject to social forces according to other moving targets 
around him/her. These have been defined in what is 
called the social force model (SFM) (23), (26) and have re¬ 
cently been applied to multiple people tracking. For the 
3D benchmark we include two baselines that include a 
few hand-designed force terms, such as collision avoidance 
or group attraction. The first method (KALMANSFM) (40) 
includes those in an online predictive Kalman filter ap¬ 
proach while the second (lpsfm) |30 | includes the social 
forces in a Linear Programming framework as described 
in Sec. |4.2[ For the 2D benchmark, we include a recent 
algorithm (MOTICON) |29|, which learns an image-based 
motion context that encodes the pedestrian's reaction to 
the environment, i.e., other moving objects. The motion 
context, created from low-level image features, leads to 
a much richer representation of the physical interactions 
between targets compared to hand-specified social force 
models. This allows for a more accurate prediction of the 
future position of each pedestrian in image space, infor¬ 
mation that is then included in a Linear Programming 
framework for multi-target tracking. 

4.7 tc_odal: Tracklet confidence 

Robust Online Multi-Object Tracking based on Track- 
let Confidence and Online Discriminative Appearance 
Learning, or TC_ODAL [8|, is the only online method 
among the baselines. It proceeds in two stages. First, 
close detections are linked to form a set of short, reliable 
tracklets. This so-called local association allows one to 
progressively aggregate confident tracklets. In case of 
occlusions or missed detections, the tracklet confidence 
value is decreased and a global association is employed 
to bridge longer occlusion gaps. Both association tech¬ 
niques are formulated as bipartite matching and tackled 
with the Hungarian algorithm. 

Another prominent component of TC_ODAL is online 
appearance learning. To that end, positive samples are 
collected from tracklets with high confidence and incre¬ 
mental linear discriminant analysis (ILDA) is employed 
to update the appearance model in an online fashion. 


5.1 Annotations 

As in many other applications, multi-target tracking 
requires a set of labeled (or annotated) videos in order to 
quantitatively evaluate the performance of a particular 
approach. Unfortunately, human supervision is neces¬ 
sary to obtain a reliable set of this, so-called ground 
truth. Depending on factors like object count, image 
quality, or the level-of-detail, annotating video data can 
be a rather tedious task. This is one of the reasons why 
there exist only relatively few datasets with publicly 
available ground truth. 

For the majority of the sequences contained in our 
benchmark, we employ the publicly available ground 
truth. The 6 new sequences ( ADL-Rundle -* and Venice-*) 
were annotated by us using the VATIC annotation tool 
(49) . We provide the ground truth for the training set, 
however, to reduce overfitting on unseen data, the an¬ 
notations for the test sequences are withheld. Annotation 
samples are illustrated in Fig. [ 7 ] 


5.1.1 Variation in the annotations 

Publicly available annotations contain a relatively large 
amount of variation. Since, as of now, we rely on dif¬ 
ferent sources for our annotations, we cannot state that 
they all follow a set of common rules. Some bounding 
boxes enclose the whole pedestrian, including all the 
limbs, which can lead to bounding boxes that change 
noticeably in size depending on the pedestrian's pose, 
as we can see in Fig. [8(a)} blue pedestrian vs. yellow 
pedestrian. In Fig. |8(b) we see that bounding boxes are 
not always centered exactly on the pedestrians, which 
could cause small shifts in the 3D position estimation. 
Another common problem is that bounding boxes for 
pedestrians that are close to the camera are usually very 
tight around the pedestrian's silhouette compared to 
pedestrians far away, as we can see in Fig. |8(c)} blue 
bounding box vs. yellow bounding box. Occlusions are 
also handled differently among sequences. While some 
annotations follow pedestrians even under full occlusion 
[20|, others create a new trajectory once the pedestrian 
reappears |18|. 

Recently, a thorough study on face detection bench¬ 
marks (35) also showed that annotation policies vary 
greatly among sequences and datasets. It also showed 
that adapting the evaluation method to be more robust 
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Method 

AvgRank 

MOTA 

MOTP 

FAR 

MT(%) 

ML(%) 

FP 

FN 

IDsw 

rel.ID 

FM 

rel.FM 

Hz 

Ref. 

2D MOT 2015 















MOTICON 

9.3 

23.1 ±16.4 

70.9 

1.8 

4.7 

52.0 

10404 

35844 

1018 

24.4 

1061 

25.5 

1.4 

[29] 

lp 2 d 

8.3 

19.8 ±14.2 

71.2 

2.0 

6.7 

41.2 

11580 

36045 

1649 

39.9 

1712 

41.4 

112.1 

baseline 

CEM 

9.2 

19.3 ±17.5 

70.7 

2.5 

8.5 

46.5 

14180 

34591 

813 

18.6 

1023 

23.4 

1.1 

mmm 

RMOT 

10.6 

18.6 ±17.5 

69.6 

2.2 

5.3 

53.3 

12473 

36835 

684 

17.1 

1282 

32.0 

7.9 


SMOT 

10.7 

18.2 ±10.3 

71.2 

1.5 

2.8 

54.8 

8780 

40310 

1148 

33.4 

2132 

62.0 

2.7 

Busk# 

TBD 

12.3 

15.9 ±17.6 

70.9 

2.6 

6.4 

47.9 

14943 

34777 

1939 

44.7 

1963 

45.2 

0.7 

pit p6| 

TC_ODAL 

12.8 

15.1 ±15.0 

70.5 

2.2 

3.2 

55.8 

12970 

38538 

637 

17.1 

1716 

46.0 

1.7 


DP_NMS 

10.7 

14.5 ±13.9 

70.8 

2.3 

6.0 

40.8 

13171 

34814 

4537 

104.7 

3090 

71.3 

444.8 


3D MOT 2015 















LPSFM 

1.7 

35.9 ±06.3 

54.0 

2.3 

13.8 

21.6 

2031 

8206 

520 

10.2 

601 

11.8 

8.4 

|30) 

lp 3 d 

2.0 

35.9 ±n.i 

53.3 

4.0 

20.9 

16.4 

3588 

6593 

580 

9.6 

659 

10.9 

83.5 

baseline 

KALMANSFM 

2.3 

25.0 ±08.5 

53.6 

3.6 

6.7 

14.6 

3161 

7599 

1838 

33.6 

1686 

30.8 

30.6 

g 


TABLE 3: Quantitative results on all baselines. 



(a) ADL-Rundle-1 (b) Venice-2 (c) PETS09-S2L2 

Fig. 7: Ground truth (manually annotated) bounding boxes on three sequences. 


against annotation variation plus reannotation of the 
data with a fixed set of rules changed the performance 
of many state-of-the-art methods. 

Even though annotations based on a standardized 
policy are not available in the benchmark yet, the larger 
size and stronger variation in the benchmark already 
exceed existing benchmarks significantly. Nonetheless, 
and following the work in |35|, we commit to standard¬ 
izing the set of annotations for all sequences following 
a common strict set of rules. These annotations will be 
published in the second release of the benchmark. 


5.2 Evaluation metrics 

In the past, a large number of metrics for quantitative 
evaluation of multiple target tracking have been pro¬ 
posed 1121, |(33), (44), (46), (47), (52). Choosing "the right" 
one is largely application dependent and the quest for a 
unique, general evaluation metric is still ongoing. On the 
one hand, it is desirable to summarize the performance 
into one single number to enable a direct comparison. On 
the other hand, one might not want to lose information 
about the individual errors made by the algorithms and 
provide several performance estimates, which precludes 
a clear ranking. 

Following the recent trend |8|, (36), (5T) we employ 
two sets of measures that have established themselves in 
the literature: The CLEAR metrics proposed by Stiefel- 
hagen et al. (47) , and a set of track quality measures 
introduced by Wu and Nevatia [52|. The evaluation 


scripts used in our benchmark are publicly available]^] 

5.2.1 Tracker-to-target assignment 
There are two common prerequisites for quantifying the 
performance of a tracker. One is to determine for each 
hypothesized output, whether it is a true positive (TP) 
that describes an actual (annotated) target, or whether 
the output is a false alarm (or false positive, FP). This 
decision is typically made by thresholding based on 
a defined distance (or dissimilarity) measure d (see 
Sec. 5.2.2) . A target that is missed by any hypothesis 
is a false negative (FN). A good result is expected to 
have as few FPs and FNs as possible. Next to the 
absolute numbers, we also show the false positive ratio 
measured by the number of false alarms per frame (FAF), 
sometimes also referred to as false positives per image 
(FPPI) in the object detection literature. 

Obviously, it may happen that the same target is 
covered by multiple outputs. The second prerequisite 
before computing the numbers is then to establish the 
correspondence between all annotated and hypothesized 
objects under the constraint that a true object should be 
recovered at most once, and that one hypothesis cannot 
account for more than one target. 

For the following, we assume that each ground truth 
trajectory has one unique start and one unique end 
point, i.e. that it is not fragmented. Note that the current 
evaluation procedure does not explicitly handle target 
re-identification. In other words, when a target leaves 


5. http://motchallenge.net/devkit 














(a) TUD-Campus 


(b) AVG-TownCentre 


(c) ETH-Crossing 


Fig. 8: Publicly available ground truth bounding boxes on three sequences. 





Fig. 9: Four cases illustrating tracker-to-target assignments, (a) An ID switch occurs when the mapping switches 
from the previously assigned red track to the blue one. (b) A track fragmentation is counted in frame 3 because 
the target is tracked in frames 1-2, then interrupts, and then reacquires its 'tracked' status at a later point. A new 
(blue) track hypothesis also causes an ID switch at this point, (c) Although the tracking results is reasonably good, 
an optimal single-frame assignment in frame 1 is propagated through the sequence, causing 5 missed targets (FN) 
and 4 false positives (FP). Note that no fragmentations are counted in frames 3 and 6 because tracking of those 
targets is not resumed at a later point, (d) A degenerate case illustrating that target re-identification is not handled 
correctly. An interrupted ground truth trajectory will typically cause a fragmentation. Also note the less intuitive 
ID switch, which is counted because blue is the closest target in frame 5 that is not in conflict with the mapping 
in frame 4. 


the field-of-view and then reappears, it is treated as an 
unseen target with a new ID. As proposed in [ |47| , the 
optimal matching is found using Munkre's (a.k.a. Hun¬ 
garian) algorithm. However, dealing with video data, 
this matching is not performed independently for each 
frame, but rather considering a temporal correspon¬ 
dence. More precisely, if a ground truth object i is 
matched to hypothesis j at time t — 1 and the distance 
(or dissimilarity) between i and j in frame t is below 
tdr then the correspondence between i and j is carried 
over to frame t even if there exists another hypothesis 
that is closer to the actual target. A mismatch error (or 
equivalently an identity switch, IDSW) is counted if a 
ground truth target i is matched to track j and the last 
known assignment was k ^ j. Note that this definition 
of ID switches is more similar to (33) and stricter than 
the original one |47) . Also note that, while it is cer¬ 
tainly desirable to keep the number of ID switches low, 
their absolute number alone is not always expressive 
to assess the overall performance, but should rather be 
considered in relation to the number of recovered target. 
The intuition is that a method that finds twice as many 
trajectories will almost certainly produce more identity 
switches. For that reason, we also state the relative 
number of ID switches, which is computed as IDSW / 


Recall. 

These relationships are illustrated in Fig. [9] For sim¬ 
plicity, we plot ground truth trajectories with dashed 
curves, and the tracker output with solid ones, where 
the color represents a unique target ID. The grey areas 
indicate the matching threshold (see next section). Each 
true target that has been successfully recovered in one 
particular frame is represented with a filled black dot 
with a stroke color corresponding to its matched hypoth¬ 
esis. False positives and false negatives are plotted as 
empty circles. See figure caption for more details. 

After determining true matches and establishing the 
correspondences it is now possible to compute the met¬ 
rics. We do so by concatenating all test sequences and 
evaluating on the entire benchmark. This is in general 
more meaningful instead of averaging per-sequences 
figures due to the large variation in the number of 
targets. 

5.2.2 Distance measure 

To determine how close a tracker hypothesis is to the 
actual target, we will distinguish two cases as described 
below (see also Fig. [To] >. 

2D. In the most general case, the relationship between 
ground truth objects and a tracker output is established 
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Fig. 10: The closeness between the tracker output (blue) 
and the true location of a target (black dashed) can be 
computed as a bounding box overlap or as Euclidean 
distance in world coordinates. 


using bounding boxes on the image plane. Similar to 
object detection |19j, the intersection over union (a.k.a. 
the Jaccard index) is usually employed as the similarity 
criterion, while the threshold t d is set to 0.5 or 50%. 

3D. When both locations, that of the tracker and that 
of the ground truth, are available as points in world 
coordinates, it is more sensible to directly compute the 
performance in 3D. To that end, d simply corresponds 
to the Euclidean distance and t d is set to 1 meter for 
pedestrian tracking. 


5.2.3 Multiple Object Tracking Accuracy 

The MOTA |47| is perhaps the most widely used figure 
to evaluate a tracker's performance. The main reason for 
this is its expressiveness as it combines three sources of 
errors defined above: 


MOTA = 1 - 


Et (FN t + FP t +IDSW,) 
E,GT t 


( 1 ) 


5.2.4 Multiple Object Tracking Precision 

The Multiple Object Tracking Precision is the average 
dissimilarity between all true positives and their corre¬ 
sponding ground truth targets. For bounding box over¬ 
lap, this is computed as 


MOTP = E — , (2) 

Et c i 

where c t denotes the number of matches in frame t 
and d tj i is the bounding box overlap of target i with 
its assigned ground truth object. MOTP thereby gives 
the average overlap between all correctly matched hy¬ 
potheses and their respective objects and ranges between 
t d := 50% and 100%. 

It is important to point out that MOTP is a measure of 
localization precision, not to be confused with the positive 
predictive value or relevance in the context of precision / 
recall curves used, e.g. f in object detection. 

As we can see in Tab. |3j MOTP shows a remarkably 
low variation across different methods ranging between 
69.6% and 71.6%. The main reason for this is that this 
localization measure is primarily dominated by the de¬ 
tections and the annotations and is less influenced by 
the actual tracker output. 

If computed in 3D, the definition changes slightly to 

MOTP 3D = 1 - ■ (3) 

td • Et c t 

Note that here it is normalized to be between 0 and 100%. 

5.2.5 Track quality measures 


where t is the frame index and GT is the number of 
ground truth objects. We report the percentage MOTA 
(—oc, 100] in our benchmark. Note that MOTA can also 
be negative in cases where the number of errors made 
by the tracker exceeds the number of all objects in the 
scene. 

Even though the MOTA score gives a good indica¬ 
tion of the overall performance, it is highly debatable 
whether this number alone can serve as a single perfor¬ 
mance measure. 

Robustness. One incentive behind compiling this 
benchmark was to reduce dataset bias by keeping the 
data as diverse as possible. The main motivation is to 
challenge state-of-the-art approaches and analyze their 
performance in unconstrained environments and on un¬ 
seen data. Our experience shows that most methods can 
be heavily overfitted on one particular dataset, but are 
not general enough to handle an entirely different setting 
without a major change in parameters or even in the 
model. 

To indicate the robustness of each tracker over all 
benchmark sequences, we show the standard deviation 
on their MOTA score. 


Each ground truth trajectory can be classified as mostly 
tracked (MT), partially tracked (PT), and mostly lost 
(ML). This is done based on how much of the trajectory is 
recovered by the tracking algorithm. A target is mostly 
tracked if it is successfully tracked for at least 80% of 
its life span. Note that it is irrelevant for this measure 
whether the ID remains the same throughout the track. 
If a track is only recovered for less than 20% of its 
total length, it is said to be mostly lost (ML). All other 
tracks are partially tracked. A higher number of MT and 
few ML is desirable. We report MT and ML as a ratio 
of mostly tracked and mostly lost targets to the total 
number of ground truth trajectories. 

In certain situations one might be interested in ob¬ 
taining long, persistent tracks without gaps of untracked 
periods. To that end, the number of track fragmentations 
(FM) counts how many times a ground truth trajectory is 
interrupted (untracked). In other words, a fragmentation 
is counted each time a trajectory changes its status from 
tracked to untracked and tracking of that same trajectory 
is resumed at a later point. Similarly to the ID switch 
ratio ( cf . Sec. |5.2.1| ), we also provide the relative number 
of fragmentations as FM / Recall. 
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Fig. 11: Each marker represents a tracker's performance 
measured by MOTA (x-axis) and its speed measured in 
frames per second (FPS) [Hz], i.e. higher and more right 
is better. Real-time ability is assumed at 25 FPS. 
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5.2.6 Runtime 

Most research in multi-target tracking focuses on push¬ 
ing the performance towards more accurate results with 
fewer errors. However, from the practical point of view, 
a method should be able to compute the results in a 
reasonable time frame. Of course, 'reasonable' varies 
depending on the application. Autonomous vehicles or 
tasks in robotics would require real-time functionality; 
surveillance assistance may tolerate a certain delay, while 
long-term video analysis may allow even much longer 
processing times. We demonstrate the relationship be¬ 
tween tracker accuracy and speed in Fig. [lT| As we can 
see, the fastest approach, DP_NMS, performs worse on 
average while the baseline lp2d provides a good balance 
between speed and performance. 

Note that accurately measuring the efficiency of each 
method is not straightforward. All baselines from Sec. [4] 
were executed on the same hardware (2.6 GHz x 16 cores 
CPU, 32 GB RAM) and the reported numbers do not 
include the detector's time. For all submitted results 
we cannot verify the efficiency ourselves and therefore 
report the runtime as specified by the respective user. 

5.2.7 Tracker ranking 

As we have seen in this section, there are a number of 
reasonable performance measures to assess the quality 
of a tracking system, which makes it rather difficult to 
reduce the evaluation to one single number. To never¬ 
theless give an intuition on how each tracker performs 
compared to its competitors, we compute and show 
the average rank for each one by ranking all trackers 
according to each metric and then averaging across 
all ten performance measures. Interestingly, the average 
rank roughly corresponds to the MOTA ordering, which 


indicates that the tracking accuracy is a good approxi¬ 
mation of the overall tracker performance. 

6 Conclusion and Future Work 

We have presented a novel platform for evaluating 
multi-target tracking approaches. Our centralized bench¬ 
mark consists of both existing public videos as well as 
new challenging sequences and is open for new submis¬ 
sions. We believe that this will enable a fairer comparison 
and guide research towards developing more generic 
methods that perform well in unconstrained environ¬ 
ments and on unseen data. 

In future, we will work on the standardization of the 
annotations for all sequences, continue our workshops 
and challenges series, and also introduce various other 
(sub-)benchmarks to welcome researchers and practi¬ 
tioners from other disciplines. 
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